Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensuring that Spark Uncaught exceptions are handled in the Snappy side and do not cause a system.exit #342

Merged
merged 5 commits into from
Sep 9, 2016

Conversation

hbhanawat
Copy link
Contributor

@hbhanawat hbhanawat commented Aug 25, 2016

Changes proposed in this pull request

SnappyUncaughtExceptionHandler is now set as the uncaught exception handler in the executor. SnappyUncaughtExceptionHandler exits the executor component unlike the SparkUncaughtExceptionHandler which does a System.exit.

While exiting the executor, check if snappy store is shutting down. If yes, there is no need to restart the executor.

Patch testing

Added a test that tests restarting of executors because of an uncaught thread exception.
Added a test that tests restarting of executors because of an executor going down because of a fatal error.

Other PRs

TIBCOSoftware/snappy-spark#2

@rishitesh
Copy link
Contributor

Looks ok to me. I suppose the cause of Uncaught exception is being tracked somewhere else ?

@kneeraj
Copy link

kneeraj commented Aug 26, 2016

I did this 36837ae

for the uncaught exception handler.

Please see if this is useful.

@hbhanawat
Copy link
Contributor Author

hbhanawat commented Aug 26, 2016

I suppose the cause of Uncaught exception is being tracked somewhere else ?

@rishitesh Please find it here TIBCOSoftware/snappy-spark#2

I did this 36837ae for the uncaught exception handler.

@kneeraj we need to handle the call to SparkUncaughtException inside the Taskrunner and hence Spark layer changes are needed.

@kneeraj
Copy link

kneeraj commented Aug 26, 2016

Yes. I remembered why I stopped there.

BTW for now it is ok but do you think @hbhanawat a cleaner way will be to delegate the behavior to the Scheduler backend and let them decide what to do with that and the default can be the current behavior in the base class.

@sumwale
Copy link
Contributor

sumwale commented Aug 26, 2016

I suppose the cause of Uncaught exception is being tracked somewhere else ?

@rishitesh Please find it here TIBCOSoftware/snappy-spark#2

The question is why a CSV formatting error results in an uncaught exception. Isn't that a normal exception that should be sent back to the driver? I think Neeraj mentioned some reason but cannot recall right now. We need to fix that reason first (no exit for uncaught exceptions is also required, of course).

@kneeraj
Copy link

kneeraj commented Aug 26, 2016

Hemant we should test the same import problem which we saw at citi. All the lines in the csv had some problems and the import failed and it closed the underlying channel between driver and executor and the IOException coming from there was then an uncaught exception which led to system.exit

…andler in the executor. SnappyUncaughtExceptionHandler exits the executor component unlike the SparkUncaughtExceptionHandler which does a System.exit.

Added a new test to test the uncaughtexception handler.
@hbhanawat
Copy link
Contributor Author

@kneeraj I tried reproducing the csv issue that you mentioned by specifying incorrect schema but could not. The task and job fails gracefully.

@sumwale as discussed in the meeting, you can have a look at the issue because I don't have context and I am not sure how to repro the channel close issue. This PR handles the uncaught exception issue. We can keep the JIRA open after this PR is fixed to track the channel close issue.

@sumwale @rishitesh @kneeraj please re-review the new code.

@kneeraj
Copy link

kneeraj commented Sep 6, 2016

Looks ok to me. ( Some conflicts have come )
But we should file a ticket a track the issue of uncaughtexception when spark csv import was failing due to unexpected format in the csv fields and track that.

@hbhanawat hbhanawat merged commit 2295056 into master Sep 9, 2016
@hbhanawat hbhanawat deleted the SNAP-846 branch September 9, 2016 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants